Extending the Linear Model

Overview

  • Interactions
  • Nonlinear Terms
  • Some experiments in R
  • Potential Problems
  • \(K\)-Nearest Neighbors for continuous response variables

For further reading, see [JWHT], pp. 87-121.

Last time: The Linear Regression Model

In Linear Regression, we model \(f(X)\) with:

\[ Y = \beta_0 + \beta_1 X_1 +\beta_2 X_2 + \cdots + \beta_p X_p + \epsilon \]

Easy to predict using \(f\), but also to infer.

  • Confidence intervals and p-values for the \(\beta_i\)s.
    • Interpreting multiple variables; conditioning.
  • Model fit using \(R^2\).
  • \(F\)-test for omnibus association

Interactions and Nonlinear Terms

Assumptions of the Linear Model

The model \(Y = \beta_0 + \beta_1 X_1 +\beta_2 X_2 + \cdots + \beta_p X_p + \epsilon\) requires some assumptions.

  • Additivity: The effect of changing \(X_i\) is completely determined by its coefficient \(\beta_i\), and has no effect on the other coefficients \(\beta_j\), \(j\neq i\).
  • Linearity: The effect of changing \(X_i\) by one unit is always the same, regardless of the value of \(X_i\).

Relaxing the additive assumption

  • Consider the model sales ~ TV + radio.
    • What if increasing radio actually makes TV advertising more effective?
    • The TV slope should be greater for higher values of radio.
  • Replace the model \(Y = \beta_0 + \beta_1 X_1 +\beta_2 X_2 + \epsilon\) with \(Y = \beta_0 + \beta_1 X_1 +\tilde{\beta}_2 X_2 + \epsilon\), where \(\tilde{\beta}_2 = \beta_1 + \beta_3X_2\).
  • This is equivalent to adding an interaction term \(\beta_3X_1X_2\): \[ Y = \beta_0 + \beta_1 X_1 +\beta_2 X_2 + \beta_3 X_1X_2 + \epsilon \]

Coding Interactions in R

library(tidyverse)
Advertising <- read_csv("https://www.statlearning.com/s/Advertising.csv")

## Original model:
adMod1 <- lm(sales ~ TV + radio, data = Advertising)
## Model with interaction term:
adMod2 <- lm(sales ~ TV + radio + TV:radio, data = Advertising)
## Equivalent syntax using *:
adMod2a <- lm(sales ~ TV * radio, data = Advertising)

Original Model

summary(adMod1)

Call:
lm(formula = sales ~ TV + radio, data = Advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-8.7977 -0.8752  0.2422  1.1708  2.8328 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.92110    0.29449   9.919   <2e-16 ***
TV           0.04575    0.00139  32.909   <2e-16 ***
radio        0.18799    0.00804  23.382   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.681 on 197 degrees of freedom
Multiple R-squared:  0.8972,    Adjusted R-squared:  0.8962 
F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16

Model with TV:radio interaction

summary(adMod2) ## adMod2a is exactly the same

Call:
lm(formula = sales ~ TV + radio + TV:radio, data = Advertising)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.3366 -0.4028  0.1831  0.5948  1.5246 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9435 on 196 degrees of freedom
Multiple R-squared:  0.9678,    Adjusted R-squared:  0.9673 
F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

The Heirarchical Principle

  • The coefficients for TV and radio model the main effects.
  • The coefficient for TV:radio models the interaction.
  • Hierarchical principle: If we include an interaction in a model, we should also include the main effects, even if the p-values associated with their coefficients are not significant.
    • Reason: the interaction term models a variable slope (see above).

Example: Quantitative:Categorical interaction

RStudio carpentry exercise.

Dealing with nonlinearity

library(ISLR2)
ggplot(Auto, aes(x = mpg, y = horsepower)) + geom_point()

Polynomial Regression

  • If the association appears non-linear, you can add higher-power terms.
  • Linear model: \(Y = \beta_0 + \beta_1 X + \epsilon\)
  • Quadratic model: \(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + \epsilon\)
carLin <- lm(mpg ~ horsepower, data = Auto)
carQuad <- lm(mpg ~ horsepower + I(horsepower^2), data = Auto)

Linear model summary

summary(carLin)

Call:
lm(formula = mpg ~ horsepower, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.5710  -3.2592  -0.3435   2.7630  16.9240 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 39.935861   0.717499   55.66   <2e-16 ***
horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.906 on 390 degrees of freedom
Multiple R-squared:  0.6059,    Adjusted R-squared:  0.6049 
F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

Quadratic model summary

summary(carQuad)

Call:
lm(formula = mpg ~ horsepower + I(horsepower^2), data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.7135  -2.5943  -0.0859   2.2868  15.8961 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)     56.9000997  1.8004268   31.60   <2e-16 ***
horsepower      -0.4661896  0.0311246  -14.98   <2e-16 ***
I(horsepower^2)  0.0012305  0.0001221   10.08   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.374 on 389 degrees of freedom
Multiple R-squared:  0.6876,    Adjusted R-squared:  0.686 
F-statistic:   428 on 2 and 389 DF,  p-value: < 2.2e-16

Plot the models

ggplot(Auto, aes(x = horsepower, y = mpg)) +
  geom_point() +
  geom_smooth(method = lm, formula = y ~ x, color = "green") +
  geom_smooth(method = lm, formula = y ~ x + I(x^2), color = "purple")

Potential Problems

Model Assumptions

\[ Y = \beta_0 + \beta_1 X_1 +\beta_2 X_2 + \cdots + \beta_p X_p + \epsilon \]

  • \(\epsilon\) is a normal random variable with \(E(\epsilon) = 0\), \(\text{Var}(\epsilon) = \sigma^2\) (a constant).
  • The \(\epsilon\) term is independent of the \(X_i\)’s.
  • Homoscedasticity: The variability residuals should not change for different values of \(X\).
  • Likewise, the residuals should not change for different fitted values of \(Y\). (plot residuals vs. fit)

Residuals vs. Fit: Patterns are bad

plot(carLin, which = 1)

Residuals vs. Fit: Heteroscedasticity is bad

plot(carQuad, which = 1)

Residuals should be normally distributed

hist(carLin$residuals)

Normal iff Normal-QQ plot is a straight line

plot(carLin, which = 2)

Outliers

  • Points that fall well off the regression line are (bivariate) outliers.
  • You can see these in the residual plots.
  • Rule of thumb: standardized residuales \(>3\) are outliers.

Leverage

  • Points at the extremes of the \(X\) range are more influential on the regression coefficients.
  • This extremity can be quantified by the leverage statistic: \[ h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{j=1}^n (x_j - \bar{x})^2} \]
  • There’s a generalization of this formula for multiple predictors.

Correlated predictors: Collinearity

  • If some of the predictors are correlated, the are called collinear.
  • This is a problem because it can be hard to discern which predictor is having the effect.
    • The minimum RSS may be (close to) non-unique.
  • Credit example
    • Predict balance from age and limit
    • Predict balance from rating and limit

KNN for continuous response variables

K-Nearest Neighbors

Recall: You can predict the probability that a categorical response \(Y\) will be \(j\) based on a value \(x_0\) of the explanatory variable: \[ \text{Pr}(Y = j \mid X = x_0) = \frac{1}{K} \sum_{i \in \mathcal{N}_0} I(y_i = j) \]

  • The sum is taken over a “neighborhood” of nearby points.

Similarly, you can predict the value of a continuous response \(Y\):

\[ \hat{f}(x_0) = \frac{1}{K} \sum_{x_i \in \mathcal{N}_0} y_i \]

  • The predicted \(y\)-value at \(x_0\) is simply the average of the \(K\)-nearest \(y\)-values.

Flexibility

  • Recall, larger values of \(K\) make KNN less flexible.
  • Generally, KNN is more flexible than linear regression.